VRVis - ComVis - VAST 2010 mini challenge 2

VRVis - ComVis

VAST 2010 Challenge
Hospitalization Records – Characterization of Pandemic Spread

Authors and Affiliations:

Zoltán Konyha, VRVis, konyha@vrvis.at [PRIMARY contact]
Andreas Ammer, VRVis, ammer@vrvis.at
Krešimir Matković, VRVis, matkovic@vrvis.at
Çağatay Turkay, University of Bergen, Cagatay.Turkay@ii.uib.no
Denis Gračanin, Virginia Tech, gracanin@vt.edu

Tool(s):

We have used our interactive, multiple linked views visualization application ComVis in our analysis. ComVis can visualize scalar, categorical and time series data in several different views. Each view is interactive and brushable. Brushes defined in the same view or in different views can be combined using boolean operators. The visual analysis context is captured in session files. Exchanging session files facilitates better collaboration among members of our team distributed in several cities.

We created a Python script to automate filtering and aggregation tasks. We have chosen Python because it allows rapid prototyping. Actually, the script evolved as part of our analysis and served as a powerful semi-automatic feature extraction tool. By editing the Python code, we could flexibly compute any aggregate data we required for the visual analysis in only a few minutes. This became especially obvious when we required separate symptom statistics before and after day 7 of the time period in the data set to estimate the number of people infected. Overall, this approach was more efficient than trying to implement similar custom aggregation methods in the C++ code of ComVis or implementing a generic aggregation framework.

We used Microsoft Excel^TM to create the bar chart in Figure 1.1 and to compute the numbers of people infected in different cities.

Video:

Download video (9.9 MB)

ANSWERS:

MC2.1: Analyze the records you have been given to characterize the spread of the disease. You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease. Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak. They are looking for visualization tools that will save them analysis time so they can react quickly.

Mortality rate

We first computed and examined various aggregates. The overall mortality rate in each location was computed by dividing the number of death records by the number of hospitalization records. We identified four clusters in Figure 1.1:

~ 3.5%: Aleppo, Nairobi
~ 2.5%: Columbia, Iran, Karachi, Venezuela, Yemen
~ 1.7%: Lebanon, Saudi Arabia
~ 0.1%: Thailand, Turkey

Thailand and Turkey are clear outliers (likely no epidemic). Later, we provide more information that supports this hypothesis. We refer to them as "non-virus" locations and to all others as "virus" locations. We note that ~ 0.1% mortality rate is expected without the virus infection. Therefore, almost all patients (94 to 97%) who died in "virus" locations are the victims of the virus.

Figure 1.1: Mortality rates in each location. The average excludes "non-virus" locations.

Temporal patterns in the virus outbreak

We computed two time series for each location:

H_loc(d) = {number of people that were hospitalized in location loc on day d}
D_loc(d) = {number of people that died in location loc on day d}

Figure 1.2: Bottom left: the number of deaths on each day. Right: the number of hospitalization on each day (top) and its 7 day moving average (bottom). Dates from 4-16-2009 (day 0) to 6-30-2009 (day 75) are displayed on the horizontal axes. Each curve in these diagrams represents one location. An additional curve displays the global sum. The curve specific to a location can be highlighted by brushing the location in the top left histogram. One can brush "_GLOBAL_" to highlight the curves that display global data. (Click to enlarge.)

We focus here on global patterns. Individual locations are discussed in MC2.2. The number of deaths starts to increase rapidly near day 15 (May 1). The peak of the epidemic (in terms of deaths) was on day 38 (May 24). The recovery phase lasted until approximately day 70 (June 25), although this transition is not as clearly pronounced as the outbreak.

The time series of hospitalizations oscillates rapidly. We applied a seven day moving average to smooth the curves. The number of hospitalizations reached its maximum near day 31 (May 17). There are at least two more local maxima, near day 51 (June 6) and 72 (June 27). This likely indicates the second and third waves of the epidemic. The epidemic cycle is ~21 days between those phases.

Number of days in hospital

We merged the hospitalization and death records for each location using the patient IDs as primary keys (using the Python script). For patients who died in hospital, the number of days between hospitalization and death was computed, too. The histograms of this data are shown in Figure 1.3. The histograms of Thailand and Turkey are remarkably different from all other locations. They represent the expected, roughly Gaussian, distributions. This supports the assumption that Thailand and Turkey avoided the epidemic. There is a very pronounced peak on day eight in the histogram of Aleppo. This indicates that the vast majority of victims in Aleppo died eight days after hospitalization. If we zoom on the vertical axis (see the rightmost histogram in Figure 1.3) we can see that the rest of the histogram is roughly Gaussian. The histograms of other "virus" locations are very similar to that of Aleppo and are omitted to save space.

It would be interesting to learn how long it took for surviving patients to recover and leave hospital, but that information is not contained in the data set.

There is no significant correlation between mortality and age or gender.

Figure 1.3: Histograms of the number of days between hospitalization and death in Thailand, Turkey and Aleppo.

Symptoms

We edited the script to extract the most common words found in the syndromes. For each word, the script computes the percentage of dead and surviving patients whose records include the given symptom. This approach had several shortcomings. That became obvious when we attempted to find patterns in the word frequencies. The word "and" was included. Some words and their abbreviations appeared as separate entries. We filtered out "and" and replaced some common abbreviations, including "l" and "lt"->"left", "r" and "rt"->"right", "abd"->"abdominal", "inj"->"injury". The word "pain" is a very generic symptom that often appears in expressions such as "back pain" or "abdominal pain". The data extraction preserves those combinations. We did not address several other issues, like repeated words ("abd abd pain"), missing spaces ("vomitingdiarrhea"), etc.

Figures 1.4 and 1.5 capture the process of identifying the symptoms that appear most often in dead patients in "virus" locations. Since over 94% of the dead patients are victims of the virus, we assume those are symptoms of the infection.

Figure 1.4: Top left: symptoms are displayed on the horizontal axis, the percentage of dead patients on the vertical axis. The brush selects points that represent symptoms found in many of the dead patients. Top right: each point represents one symptom. The X and Y coordinates indicate the percentages of surviving and dead patients with this symptom. Points above the diagonal represent symptoms that are likely to cause death. Bottom left: each bar represents a location. Bottom right: each bar represents a symptom.

Figure 1.5: We are not interested in symptoms of the Thailand and Turkey records. They are excluded by the brush in the bottom left histogram. There is a cluster of points under the diagonal in the top right scatter plot. It represents "pain", a very generic symptom. The cluster in the middle represents "fever", another very generic symptom. They are excluded by brushes 3 and 4.

The most characteristic virus symptoms and the percentage of victims hospitalized with those symptoms can be inferred from the highlighted items in Figure 1.5, bottom right and top left:

vomiting, 36%
diarrhea, 17%
abdominal pain: 14%

Also worth mentioning:

back pain: 7%
nose (nose bleed): 5%

MC2.2: Compare the outbreak across cities. Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities. Identify any anomalies you found.

The number of hospitalizations and deaths vary across locations, largely because of differences in population. The time series computed for MC2.1 were normalized to a common scale to compare curves. Normalized versions were computed by dividing each series with its maximum.

H_maxH_loc(d) = H_loc(d) / max(H_loc(d)) (Figure 2.3 bottom left)
D_maxD_loc(d) = D_loc(d) / max(D_loc(d)) (Figure 2.1 bottom left)

The number of deaths were normalized by the total number of hospitalizations. These time series contain a lot of information about the evolution of the epidemic because it preserves the differences in mortality rates across locations:

D_sumH_l(d) = D_l(d) / sum(H_l(d)) (Figure 2.1 bottom right)

We computed the first derivative of D_maxD_loc(d), too. Figure 2.3 top right.

Figure 2.1: Top left: Each bar represents a location. "Virus" locations (red) and the global sum (blue) are brushed. "Non-virus" locations are grey. Top right: number of deaths per day. Bottom left: normalization brings the curves to the same scale. The grey curves that belong to "non-virus" locations oscillate so wildly because they are heavily magnified by the normalization. Bottom right: number of deaths per day divided by the total number of hospitalizations. The grey curves stay close to zero.

Outbreak

We first tried to find the dates of virus outbreak in each location. We need to formulate what defines the virus outbreak. Our best bet is to examine the daily mortality rates in "virus" locations and find when they exceed the maximum in "non-virus" locations (Figure 2.2 with Aleppo highlighted). The daily mortality rate in Aleppo exceeds the threshold on day 17 (May 3). The blue brush in the histogram was dragged through other locations to get the information in Table 2.1.

The vast majority of victims die after eight days in hospital. We believe the first infected people were hospitalized eight days before mortality rate started to increase. These are the dates of the outbreak.

Figure 2.2: Top left: "non-virus" locations are brushed in red. Aleppo is brushed in blue. Bottom right: daily mortality rates. The vertical axis is zoomed (see blue scale slider) to get a better view of the lower part of the curves. We need to find the day (value on the horizontal axis) when the blue curve exceeds the maximum of the red ones (simply point there with the mouse). The mouse position (17.137) is displayed under the middle of the horizontal axis.

Other criterion (such as the mean daily mortality rate in "non-virus" locations) could also define the start of the epidemic. That would shift the outbreak dates (Table 2.1) two or three days earlier.

Peaks and outliers

Using a procedure very similar to the one in the previous paragraph, we selected individual locations and looked for the maxima in the linked daily mortality rate curves. They indicate the peaks of the epidemic, in terms of number of victims. Some curves have several pronounced local maxima so we examined the first derivates. The snapshot of this process (Figure 2.3) captures some interesting outliers. Iran has a local maximum in mortality rate on May 13, recovered a bit and two days later the mortality rate increased again. Mortality rate in Venezuela significantly increased between June 26 and 28, and in Saudi Arabia on the last day in data set.

Figures 2.4 and 2.5 show two clusters in the shapes of the daily mortality rates.

Figure 2.3: Top right: first derivative of the mortality rate with outliers brushed. Bottom left: normalized number of hospitalizations. Bottom right: number of deaths per day divided by the total number of hospitalizations.

Figure 2.4: The meaning of the views is the same as in Figure 2.3. The selection in the top right is different. Curves that cross the line brush (thick black line) are selected. The shapes of the mortality rate curves (bottom right) in Colombia, Iran, Lebanon, Saudi Arabia and Venezuela are similar.

Figure 2.5: Similar to Figure 2.4, but curves that do not cross the same brush are selected. Aleppo, Karachi, Nairobi and Yemen are also similar.

Number of people infected

We can only infer the number of people hospitalized because of the virus infection. There is no information about the infected population.

Our calculations are based on the numbers of patients with the most characteristic symptoms. We found approximately the same frequency of symptoms in all locations (virus and non-virus) during the first seven days:

vomiting: 4.8%
diarrhea: 2.1%
abdominal pain: 2.6%

We used those numbers as reference values before the epidemic. The number of patients hospitalized with vomiting during the first seven days can be extrapolated to estimate how many would have been hospitalized in the next 69 days, had there been no epidemic. The difference to the actual numbers after day seven equals the number of patients hospitalized with vomiting because of the epidemic. Let n denote that number. We know p, the percentage of infected patients with the symptom “vomiting” after the virus outbreak. The number of people hospitalized with virus infection is n/p. We get similar results with the other two symptoms.

We similarly estimated the increase in number of deaths because of the epidemic and computed the mortality rate of the epidemic from the number of deaths related to the epidemic and the number of people hospitalized because of the epidemic (Table 2.1).

Location	Increase in mortality	Outbreak	Peak in mortality	Number of infections	Mortality rate of the epidemic
Aleppo	May 3	Apr 25	May 23	556586	14.0%
Colombia	May 6	Apr 28	May 28	137727	11.7%
Iran	May 5	Apr 27	May 27	80340	14.7%
Karachi	May 4	Apr 26	May 25	1211790	13.6%
Lebanon	May 3	Apr 25	May 25	37920	19.8%
Nairobi	May 1	Apr 23	May 22 - May 23	328862	13.2%
Saudi Arabia	May 4	Apr 26	May 25 - May 26	152004	14.0%
Venezuela	May 4	Apr 26	May 27	33286	11.1%
Yemen	May 3	Apr 25	May 24 - May 25	63451	12.0%

Table 2.1: Summary of the virus outbreak across cities.

VRVis - ComVis

VAST 2010 Challenge Hospitalization Records – Characterization of Pandemic Spread

Authors and Affiliations:

Tool(s):

VAST 2010 Challenge
Hospitalization Records – Characterization of Pandemic Spread